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ABSTRACT 

The ArrayExpress Archive of Functional Genomics 
Data (http://www.ebi.ac.uk/arrayexpress) is one of 
three international functional genomics public data 
repositories, alongside the Gene Expression 
Omnibus at NCBI and the DDBJ Omics Archive, sup- 
porting peer-reviewed publications. It accepts data 
generated by sequencing or array-based techno- 
logies and currently contains data from almost a 
million assays, from over 30 000 experiments. The 
proportion of sequencing-based submissions has 
grown significantly over the last 2 years and has 
reached, in 2012, 15% of all new data. All data are 
available from ArrayExpress in MAGE-TAB format, 
which allows robust linking to data analysis and 
visualization tools, including Bioconductor and 
GenomeSpace. Additionally, R objects, for micro- 
array data, and binary alignment format files, for 
sequencing data, have been generated for a signifi- 
cant proportion of ArrayExpress data. 



INTRODUCTION 

The ArrayExpress Archive of Functional Genomics Data 
(1) is one of the major international repositories for func- 
tional genomics high throughput data, supporting publi- 
cations as well as various data generating consortia. It 
stores functional genomics data derived from high 
throughput sequencing (HTS) and microarray-based ex- 
periments. Users come to ArrayExpress to (i) find func- 
tional genomics experiments that might be relevant to 
their research; (ii) retrieve information describing these 



experiments and the data associated with them; (iii) 
retrieve data for including in their own local data ware- 
houses or added value databases; and (iv) submit their 
own data supporting a peer-reviewed publication. 

Once submitted, data may be kept in ArrayExpress as 
private for a limited period of time, typically during the 
peer-review process of the related publication. Upon sub- 
mission, an accession number is assigned to it and access 
to the data is restricted to providers/reviewers via a login 
system. The submitter specifies the release date and the 
data becomes public either when the accession number 
associated with the data is cited in a publication or at 
the set release date, whichever comes first. 

All submissions are automatically checked for compli- 
ance to the Minimum Information About a Microarray 
Experiments (MIAME) (2) or Minimum Information 
about Sequencing Experiments (MINSEQE - http:// 
www.fged.org/projects/minseqe/) guidelines, for micro- 
array and sequencing-based experiments, respectively. 
The MIAME/MINSEQE scores associated with an ex- 
periment are displayed in the ArrayExpress interface and 
provided to submitters. 

In addition to the data submitted directly to 
ArrayExpress, data from the Gene Expression Omnibus 
(GEO) (3) are imported to provide users with a single 
access to most of the functional genomics data available 
in the public domain. All data are organized, and available 
for download, in a structured and standardized format, 
MAGE-TAB (4), which also facilitates linking to open 
source analysis environments such as Bioconductor (5) 
and GenomeSpace (http://www.genomespace.org). A 
format conversion tool, from GEO SOFT to MAGE- 
TAB (6), is run on all GEO HTS and microarray data. 
The conversion is successful in 83% of cases; there are 
various reasons why this conversion may fail, including 
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failure to parse SOFT files correctly or failure to retrieve 
the associated data files and we are constantly working 
with GEO to increase the success rate. All HTS data are 
exchanged with GEO and a data sharing agreement with 
the DDBJ Omics Archive is also in place (7). 

For all experiments, the column labels describing the 
sample (e.g. disease) and its characteristics (e.g. type II 
diabetes) are mapped to the EBFs Experimental Factor 
Ontology (EFO) (8) and the data loaded into ArrayExpress. 
This allows consistent query results to be returned from 
direct submissions as well as imported data. As data are 
curated for Gene Expression Atlas use (9), they are 
reloaded into ArrayExpress with enriched annotation. 

The ArrayExpress user interface allows users to search 
for experiments of interest by keywords and ontology 
terms, which enable semantically driven searches of the 
experimental metadata; for instance searching with the 
EFO term 'cancer' will also find experiments investigating 
'leukemia' even if 'cancer' is not mentioned explicitly. 
Both US and UK spelling is supported. 

DATA GROWTH TO A MILLION ASSAYS 

Over the last 2 years, the database content has grown 
from 13 000 experiments and 370000 assays, to over 
30000 experiments and almost a million assays. 
Approximately 20% of the data were submitted directly 
to ArrayExpress; the rest are imported from GEO weekly. 

Although HTS-based experiments account for only 6% 
of the entire database content, the proportion of new HTS 
submissions has been growing exponentially over the last 
few years, from 2% in 2009 to 6% in 2010, 7% in 2011 
and 15% in 2012. Nevertheless, the total number of assays 
associated with HTS-based experiments is still only 3%, 
reflecting the fact that HTS experiments are typically 
smaller than microarray-based experiments. If we look 
at a breakdown of the HTS data by application, 50% of 
the experiments used RNA-seq only, 32% ChlP-seq only 
and the remaining experiments either utilized more than 
one application or used DNA-seq for genotyping, copy 
number variation detection or methylation profiling. 

For HTS data, ArrayExpress stores processed data and 
metadata describing the sample properties and the experi- 
mental design, including experimental variables and 
protocols, whereas raw sequence data are stored in the 
European nucleotide archive (EN A) (10) and linked 
from ArrayExpress. For datasets that require controlled 
access, the raw sequence data are stored in, and should be 
submitted directly to, the European Genome-phenome 
Archive (EGA - www.ebi.ac.uk/ega). 

LINKS TO DATA ANALYSIS TOOLS 

Approximately 50 GB of data are downloaded every day 
from ArrayExpress, by an average of 1000 different users. 
To simplify the interface between ArrayExpress and ana- 
lytical platforms, we are now providing links to popular 
analytical tools such as Bioconductor and GenePattern 
(11), as well as developing robust internal pipelines for 
HTS data processing. 



To facilitate loading microarray data from 
ArrayExpress into Bioconductor, we have pre-generated 
R objects for 16250 out of 25 000 gene expression micro- 
array experiments with raw data files available. A revised 
version of the Bioconductor package ArrayExpress (12) is 
used with default parameters. The package has been 
updated to support popular data formats including 
Affymetrix and Agilent. More than 85% of Affymetrix 
data in the repository have downloadable R objects. 
Older submissions, other technologies and experiments 
with only processed data available can still be loaded in 
R, but require user-specified settings for the package to 
recognize the data format, so loading must be supervised 
by a user. All pregenerated R objects are now available 
through the ArrayExpress interface and can be easily 
loaded into Bioconductor for downstream analysis. 
More R objects will be created for experiments in 
ArrayExpress as more data arrive, and the R package 
will be maintained and extended for this purpose. 

Direct links are now provided to GenomeSpace (http:// 
www.genomespace.org), a data analysis environment that 
makes it possible for users to move data smoothly between 
popular bioinformatics tools. From ArrayExpress, the 
user can, with a single click, load a dataset into 
GenomeSpace, provided that he/she has a registered 
account with GenomeSpace. Once logged in, the user 
will be able to utilize the data analysis tools available 
through GenomeSpace, including GenePattern, Galaxy 
(13) and Cytoscape (14), to perform data analysis. 

For HTS data, the Bioconductor package 
ArrayExpressHTS (15) and the R-workbench (http:// 
www.ebi.ac.uk/Tools/rcloud/) are used to generate 
binary alignment (BAM) format files (16). BAM files 
contain sequence alignment data and can be displayed 
using the Ensembl genome browser (17), through a 
direct link from ArrayExpress. So far approximately 
1200 BAM files are available for 125 RNA-seq experi- 
ments, for 14 different species, with over half of these 
data studying human and a quarter mouse. The BAM 
file generation has been done for experiments for which: 
(i) the sample-data relationship information is available 
and contains details such as the library strategy and the 
experiment type (i.e. RNA-seq); (ii) the raw sequence 
reads (FASTQ files) are deposited in ENA and a valid 
link to the ENA entry is present; and (iii) the annotation 
for the reference genome is available in Ensembl. 

In addition, 3000 datasets from ArrayExpress have been 
analysed and the results of this analysis are presented 
through the Gene Expression Atlas (9), a separate EBI 
database, which helps users to (i) find out whether the 
expression of a gene (or a group of genes with a 
common gene attribute, e.g. GO term) change(s) across 
all the experiments or (ii) discover which genes are differ- 
entially expressed in a particular biological condition of 
interest. 

CONTINUOUS USER INTERFACE IMPROVEMENTS 

The ArrayExpress user interface has been continuously 
improved since the repository was established in 
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Experiment E-MTAB-513 

RNA-Seq of human individual tissues and mixture of 16 tissues (Illumina Body Map) (19 samples) 
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Figure 1. Sample-data relationship viewer for Experiment E-MTAB-513. This view provides information on sample characteristics and experimental 
variables that are fundamental to understand the results obtained in the experiment. Generally, each row corresponds to a sample. Columns include 
sample characteristics and their relationship to the resulting data files, providing a quick view over the structure of the experiment and the biological 
questions that the authors addressed. The last column provides links to raw sequence data files available in ENA, and BAM files that can be 
visualized in the Ensembl genome browser. 



2003 (18). Recent additions include the sample-data rela- 
tionship viewer (Figure 1), which provides an overview of 
all samples used in an experiment and their characteristics, 
the experimental variables (factors) investigated and the 
data files associated with each sample. 

Other improvements include (i) improved array designs 
browsing and querying for; (ii) specific features for HTS 
data display; (hi) better organization of the species 
drop-down filter, and (iv) improved performance for 
retrieving and visualizing large experiments. 

The ArrayExpress user documentation has recently 
been updated and several online courses, covering how 
to search, interpret and submit data to ArrayExpress, 
can be found on the EB1 e-Learning portal. Train online 
(http : / /www.ebi . ac . uk/training/online /) . 



FUTURE DEVELOPMENTS 

We are currently developing a new submission tool, 
optimized for supporting HTS data submissions; this 
new tool is based on the community developed annotation 
tool Annotare (19) and will be released in 2013. 

Like all other major EBI data resources, ArrayExpress 
is working toward deeper integration in the overall EBI 
infrastructure, in particular with the BioSample Database 



(20), the Gene Expression Atlas and the sequence data- 
bases ENA, EGA and Ensembl. We will continue this in- 
tegration effort to ensure that our users can obtain a 
systems level view of the data stored at EBI by easily 
navigating through our resources. 
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